Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add API to download all files by dataset #4529 #7086

Merged
merged 6 commits into from
Jul 29, 2020
Merged

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Jul 14, 2020

What this PR does / why we need it:

API users have long wanted a way to download all the files in a dataset using its persistent identifier (DOI or Handle).

Which issue(s) this PR closes:

Closes #4529

Special notes for your reviewer:

  • One thought I had just now is that there's no way to specify original format vs archival format. Should this be added? I added docs and tests for format=original in 44cc815.
  • I made no attempt to address File API download bypasses terms of use and guestbook #2911 which is about how terms and guestbook acceptance can by bypassed via API. In my testing, I was able to bypass a guestbook with required fields for name and email.
  • I assume that Make Data Count is covered deeper in the API than I went (by existing APIs I'm calling into, I mean).

Suggestions on how to test this:

Follow the API docs. Try variations with restricted files, tabular files, guestbook/terms.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

Included.

@coveralls
Copy link

coveralls commented Jul 14, 2020

Coverage Status

Coverage decreased (-0.05%) to 19.605% when pulling fc7df59 on 4529-download-all-api into c1a3834 on develop.

@djbrooke djbrooke added this to the Dataverse 5 milestone Jul 14, 2020
@landreev landreev self-requested a review July 14, 2020 20:20
@landreev landreev self-assigned this Jul 14, 2020
@djbrooke djbrooke moved this from Code Review 🦁 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 15, 2020
@djbrooke djbrooke assigned pdurbin and unassigned landreev Jul 15, 2020
@pdurbin
Copy link
Member Author

pdurbin commented Jul 15, 2020

Ok, I'm ready for more review. I updated the description above to reflect that I added docs and tests for format=original (no code changes required).

I also did a quick test of a dataset with a guestbook. I made no attempt to address #2911 which is about how guestbooks and terms can be bypassed via API. The UI says name and email are required (screenshot below) but the API continues to let you have the files.

Screen Shot 2020-07-15 at 4 19 07 PM

@pdurbin pdurbin removed their assignment Jul 15, 2020
@pdurbin pdurbin moved this from IQSS Team - In Progress 💻 to Code Review 🦁 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 15, 2020
@landreev landreev self-assigned this Jul 16, 2020
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we weren't handling that "format=original" parameter as an obvious @QueryParam; finding it instead by going through UriInfo.getQueryParameters()... Probably not worth the trouble of changing it; seeing how it's been like that forever, in the /access/datafiles/ - it's just strange.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be nitpicking - but I'm not sure I like "downloadAll" as the name of the API. Should it be more self-explanatory, and along the lines of the existing APIs?
/api/access/datafile/<id> downloads a datafile;
/api/access/dataset/<id> downloads the files in a dataset?
(I'm open to hearing other opinions on this)

@scolapasta
Copy link
Contributor

I'm in favor of: /api/access/dataset/

@landreev
Copy link
Contributor

I'm not requesting this as a change, in this PR, but just want to put this on record: there is a potential optimization that can be added - instead of doing a full isAccessAuthorized() lookup for every file, it is possible to cache some permission information (for ex. - you don't really need to figure out whether this user has the permission to view this unpublished dataset every time from scratch). Which could in certain cases result in non-trivial savings... I remember this because we used to do that, in /api/access/datafiles/... Until we realized that even though that API call was originally created for downloading multiple files from the dataset page, arbitrary file ids spanning multiple datasets could be supplied to it directly - so extrapolating any file permission info, under the assumption that all the other files are within the same dataset was obviously wrong and dangerous. But that code is still in github history, and now that we have a call where staying within the same dataset is guaranteed, we could potentially revive it.
It's not 100% clear to me whether it's worth bothering with - so I'm going to open a new issue for it; and we could decide later on whether we want to touch this.

@pdurbin pdurbin assigned pdurbin and unassigned landreev Jul 17, 2020
@pdurbin pdurbin moved this from Code Review 🦁 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 17, 2020
@pdurbin
Copy link
Member Author

pdurbin commented Jul 17, 2020

Ok, in 4da877b I renamed the endpoint from "downloadAll" to "dataset". Back to code review.

@pdurbin pdurbin removed their assignment Jul 17, 2020
@pdurbin pdurbin moved this from IQSS Team - In Progress 💻 to Code Review 🦁 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 17, 2020
@landreev landreev self-assigned this Jul 21, 2020
@kcondon kcondon self-assigned this Jul 21, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2020

@pdurbin issues found so far:

  1. Error handling is sparse -seems to just say "error" no matter what the error is: lack of perms, bad arg, bad key, etc.

No information when attempting to download something you do not have perms for, just says "error". In single file download, says:
{"status":"ERROR","code":403,"message":"'/api/v1/access/datafile/53822' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."}

@pdurbin
Copy link
Member Author

pdurbin commented Jul 27, 2020

ThrowableHandler

This was recently changed in #7085

@qqmyers
Copy link
Member

qqmyers commented Jul 27, 2020

ThrowableHandler was added prior to #7085. When it was added, it stopped the default handing for exceptions such as javax.ws.rs.NotAcceptableException . What #7085 did was to add more exception/http code-specific handlers to manage the ones Dataverse source throws so that they don't get caught be the new default ThrowableHandler. My guess is that javax.ws.rs.NotAcceptableException is being thrown by the framework code rather than Dataverse source code, so I didn't add a specific handler for it. The fix is probably to just add yet-another handler for this specific exception and just send a 406 response as intended. The Redirect handler is probably a good example.

@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2020

@qqmyers Should it be a separate ticket?

@qqmyers
Copy link
Member

qqmyers commented Jul 27, 2020

@kcondon -yeah - not related to this PR that I can see. Also - is this 406 error from a test? If so, it might also make sense to fix the test (unless we really wanted to test 406 instead of seeing if the dcterms metadata can be downloaded.

@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2020

@qqmyers I'm not sure what's causing it. It appears to happen on it's own.

@qqmyers
Copy link
Member

qqmyers commented Jul 27, 2020

FWIW - datasets/export can produce "application/xml", "application/json", or "application/html". My guess is that if someone makes an API call requesting something other than one of those, it's a 406.

@kcondon kcondon assigned pdurbin and unassigned kcondon Jul 27, 2020
@kcondon kcondon moved this from Code Review 🦁 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 27, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2020

@pdurbin This works but since it only ever seems to say "error" when something goes wrong, maybe that could be looked at?

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 28, 2020

@kcondon I extended the error handling stuff in #7136. Do you wish for me to add a more meaningfull error message for error 406?

@kcondon
Copy link
Contributor

kcondon commented Jul 28, 2020

@poikilotherm Is there something more meaningful? If it doesn't add value, then no. If it provides context or potentially a direction for action, then yes. I trust your judgement. In the case for errors in this pr, it literally just says error, when there are clearly different causes and I am error prone when first using an API. :(

@pdurbin
Copy link
Member Author

pdurbin commented Jul 28, 2020

@kcondon is right. Only "ERROR" is seen like this...

{
    "status": "ERROR"
}

... if a guest user tries to download from a dataset with multiple unpublished files.

Additionally, I can get this error (I'm using PIDs instead of database IDs)...

{
    "status": "ERROR",
    "code": 403,
    "message": "'/api/v1/access/dataset/%3ApersistentId' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."
}

... if the guest user (no API token) attempts to use the new "download by dataset" endpoint on a dataset with a single restricted file.

Let me see what I can do.

Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.
@pdurbin
Copy link
Member Author

pdurbin commented Jul 28, 2020

@kcondon et al, good catch on the error handling. I wasn't using the WrappedResponse object properly but as of fc7df59 I am. This means that instead of always sending BAD_REQUEST (400), you should see UNAUTHORIZED (401) as appropriate. You should also see more detailed messages like these:

  • User :guest is not permitted to perform requested action.
  • Bad api key

I updated the tests to reflect these and others.

I didn't attempt to do anything with the "you are not authorized to access this object via this api endpoint" message because it's pre-existing and I'm trying to be mindful of scope. If I call into the older "datafiles" (plural) API on which this new "download by dataset" API is build, you get a similar message. Here's an attempt to use that API by the guest user to download a single restricted file:

curl http://localhost:8080/api/access/datafiles/4216

{
  "status": "ERROR",
  "code": 403,
  "message": "'/api/v1/access/datafiles/4216' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."
}

Furthermore, this error at least conveys "not authorized" and gives FORBIDDEN (403) so I think it's ok.

I'm moving this back to code review. I'm aware that there's more conversation above about error handling as well as in #7134 and #7136 but I imagine it'll be handled in separate issues and pull requests.

@pdurbin pdurbin moved this from IQSS Team - In Progress 💻 to Code Review 🦁 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 28, 2020
@pdurbin pdurbin removed their assignment Jul 28, 2020
@sekmiller sekmiller self-assigned this Jul 29, 2020
@sekmiller sekmiller removed their assignment Jul 29, 2020
@kcondon kcondon self-assigned this Jul 29, 2020
@kcondon kcondon merged commit 2613d4e into develop Jul 29, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Jul 29, 2020
@kcondon kcondon deleted the 4529-download-all-api branch July 29, 2020 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Data Access API: Download by Dataset
9 participants